Welcome to WUDAC Analytics 201, Fall 2020!
In this session, we will learn how to read in a dataset, clean it up, and manipulate it using a package called dplyr.
For this session, we’ll need just four packages:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.2
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:dplyr':
##
## intersect, setdiff, union
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(magrittr)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
If you get an error there is no package called <PACKAGE NAME>’ for any of these packages, it is probably because you don’t have the package(s) installed: run install.packages(<PACKAGE NAME>) to install the missing package(s).
We will be working with a wine dataset from the UCI Machine Learning Repository. Let’s see how to read it into your R workspace.
Click here to download the file containing the data.
# First, move the the file to your desired folder location.
# Then, let's tell R where to find the file.
getwd() # this calls the current working directory
## [1] "/Users/lucy/repos/wudac/WUDAC_R201"
# If this is not where you want to save your work, change your path using the command below:
#setwd("C:/your/desired/path")
Once the file is in the right location, we can read in the dataset.
# The file extension is .csv, so we use the classic read.csv() command
data <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')
### Notice the "sep=';'" argument? That tells R to recognize ";" as the delimiter (typically, the delimiter is a comma, and we don't have to specify it.)
Now we have our data stored in a variable, so we are ready to clean it up.
In the real data world, it’s often rare that every piece of information you want is stored in a single dataframe. Usually, different concepts are stored in different dataframes. There are many reasons for doing so:
Hence, when we have two (or more) datasets, each storing pieces of information we want, join is often used to combine them.
print(students)
## ID Name Grade MajorID
## 1 12345678 John Sophomore 4
## 2 48261849 Aaron Junior 2
## 3 28475617 Sally Junior 3
## 4 88904626 Zach Freshman 3
## 5 12375692 Cassy Senior 1
## 6 10475628 Jim Senior 0
print(majors)
## orID Major_Name
## 1 0 Undecided
## 2 1 ENGL
## 3 2 STAT
## 4 3 CIS
## 5 4 MGMT
## 6 5 EALC
## 7 6 MKTG
Say we want to find out what major each student has. We can’t just get the data from any single dataset, so we need to put the two datasets together. For a join statement, there are 3 pieces of key information:
The syntax for join in dplyr is: joined_data <- type_of_join(df1, df2, by = "name of the column").
If the column(s) you’re joining on have different names in the left- and right-hand tables, like we have here, you can use the by argument c('<LEFT HAND COLUMN NAME>' = '<RIGHT HAND COLUMN NAME>'):
joined_data <- inner_join(students, majors, by = c('MajorID' = 'orID'))
print(joined_data)
## ID Name Grade MajorID Major_Name
## 1 12345678 John Sophomore 4 MGMT
## 2 48261849 Aaron Junior 2 STAT
## 3 28475617 Sally Junior 3 CIS
## 4 88904626 Zach Freshman 3 CIS
## 5 12375692 Cassy Senior 1 ENGL
## 6 10475628 Jim Senior 0 Undecided
Combining with the select statement we discussed earlier, we can create a dataset with only the student name and their major:
student_major <- select(joined_data, c("Name", "Major_Name"))
print(student_major)
## Name Major_Name
## 1 John MGMT
## 2 Aaron STAT
## 3 Sally CIS
## 4 Zach CIS
## 5 Cassy ENGL
## 6 Jim Undecided
There are 4 common types of joins, which are:
inner_join: Returns records that have matching values in both tablesleft_join: Returns all records from the left table, and the matched records from the right tableright_join: Returns all records from the right table, and the matched records from the left tablefull_join: Returns all records when there is a match in either left or right tableIt’s important to recognize which type of join should be used in a query. For example, if we have a table of people and a table of allergies with speicfic allergy types and the poeple who have that type of allergy, to know what kind of allergies each person has, it’s probably a wise idea to do a LEFT JOIN from people to allergies (i.e. keeping all records in people) since many won’t have an allergy and won’t show up in the allergies table.
Merge has basically the same functionality of join. It uses base R instead of dplyr (in other words, you don’t need to import any additional packages to use it). But join is designed to be a bit faster and work more efficiently than merge, so it’s always recommended to use join over merge.
General merge syntax: merged_date <- merge(df1, df2, by = "name of column to join on", type of join).
More specifically, we have:
outer_join = merge(df1, df2, by = "colx", all = TRUE)left_join = merge(df1, df2, by = "colx", all.x = TRUE)right_join = merge(df1, df2, by = "colx", all.y = TRUE)Suppose we want to see the average pH of all wines in the dataset. Then we can simply use
mean(data$pH)
## [1] 3.188267
But what if we want to compute average pH for each wine quality rating? We need a couple important functions: group_by() and summarize(), which are both contained within the dplyr library.
data %>%
group_by(quality) %>%
summarize(mean_pH = mean(pH))
As we can see above, group_by() creates a grouped table for each unique quality value. We then have to define how we want to summarize the values in each group. In this case, we used the mean() function, but we also could have used sum():
data %>%
group_by(quality) %>%
summarize(total_pH = sum(pH))
Or standard deviation:
data %>%
group_by(quality) %>%
summarize(stddev_pH = sd(pH))
As long as our desired grouping is the same, we can create multiple summary columns at once; for example, say we want to look at both mean pH and mean density by quality:
data %>%
group_by(quality) %>%
summarize(mean_pH = mean(pH), mean_density = mean(density))
Finally, here’s a diagram explaining how group_by() works graphically.
Source: datasciencemadesimple.com
# Load Data
arrests = read.csv('arrestssample.csv') # TODO: Make this the actual arrests data
Sampling data
If our data set is too large, it becomes unwieldy to analyze even using R. To avoid this, we can often randomly sample the data to draw valid conclusions with less computing power.
We can sample by number of rows using sample_n() or by percentage of rows using sample_frac():
arrests.sample = sample_n(arrests, size = 9000) # Specific number of rows; we'll use this dataset moving forward
arrests.samplefrac = sample_frac(arrests, 0.01) # Specific fraction of rows
Removing null values
Many real-world datasets contain null values. We usually need to deal with these before analyzing the data. One common option is to remove them, for which we can use the is.na() function:
# Drop all rows with NA values for sub_race
arrests.sample %>% filter(!is.na(sub_race))
Filtering data
We often only want to look at rows of the dataset that meet particular conditions. As shown below, we can use the filter() dplyr function to filter the data:
# Arrests occurring from noon up until 8 PM
arrests.afternoon = arrests.sample %>% filter(hour >= 12 & hour < 20)
# Arrests occurring from 5AM up until noon
arrests.morning = arrests.sample %>% filter(hour >= 5 & hour <12)
# Arrests with the "violent" or "property" crime codes
arrests.violent.property = arrests.sample %>% filter(crime_code == c("violent", "property"))
Combining/separating columns
We may also want to combine columns, which we can do using the unite() function:
# Combine race and gender columns
arrests.sample = unite(arrests.sample, sub_race, sub_gender, col = "race_and_gender", sep = " ")
arrests.sample %>% group_by(race_and_gender) %>% count()
Or conversely, we may want to separate columns, which we can do using the separate() function:
# Split this column back into the original two columns
arrests.sample = separate(arrests.sample, race_and_gender, sep = " ", into = c("sub_race", "sub_gender"))
Renaming columns
We can use the rename() function from dplyr to rename columns:
# Rename off_id to officer_id
rename(arrests.sample, officer_id = off_id) # Syntax is new name = old name
Creating new columns
We can use the mutate() function from dplyr to create new columns based on existing columns:
# Create new is_afternoon column, which is 1 if the arrest occurred in the afternoon and 0 otherwise
arrests.sample= arrests.sample %>% mutate(is_afternoon = if_else(hour >= 12, 1, 0))
Formatting dates
One thing that may be particularly useful to you in your final analysis is date formatting. In the raw data, the dates are in the form of a character vector, but we might be interested filtering based on month/year/etc, so we’ll need to transform the dates into something that is more easily to work with in R.
Here, we’ll use the mdy() function from the lubridate package we loaded earlier to parse these dates. We use mdy() because the dates are formatted MM/DD/YY in the data, but we might need to use a different function from the package for dates that are formatted differently (for example, we would use dmy() to parse dates of the form DD/MM/YY).
We can then use the month() function from the lubridate package to get the month from the date. lubridate also has year(), wday() (to get the day of the week), and other functions to get the various components of a date that may be useful for you.
ArrestDates = mdy(arrests.sample$date) #Convert the data into the date class
head(month(ArrestDates), 10) #Find the month that the arrests occurred in
## [1] 8 1 7 1 11 5 12 8 11 10
TODO: I think we can delete this since it’s covered in the group by section
# A good way to organize data is with dplyr
arrests.hour.race = arrests.sample %>% group_by(hour, sub_race) %>% count()
arrests.hour.race
arrests.violent.property %>% group_by(crime_code, sub_gender) %>% count()